Search CORE

102 research outputs found

Clinical Text Mining: Secondary Use of Electronic Patient Records

Author: Hercules Dalianis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

This open access book describes the results of natural language processing and machine learning methods applied to clinical text from electronic patient records. It is divided into twelve chapters. Chapters 1-4 discuss the history and background of the original paper-based patient records, their purpose, and how they are written and structured. These initial chapters do not require any technical or medical background knowledge. The remaining eight chapters are more technical in nature and describe various medical classifications and terminologies such as ICD diagnosis codes, SNOMED CT, MeSH, UMLS, and ATC. Chapters 5-10 cover basic tools for natural language processing and information retrieval, and how to apply them to clinical text. The difference between rule-based and machine learning-based methods, as well as between supervised and unsupervised machine learning methods, are also explained. Next, ethical concerns regarding the use of sensitive patient records for research purposes are discussed, including methods for de-identifying electronic patient records and safely storing patient records. The book’s closing chapters present a number of applications in clinical text mining and summarise the lessons learned from the previous chapters. The book provides a comprehensive overview of technical issues arising in clinical text mining, and offers a valuable guide for advanced students in health informatics, computational linguistics, and information retrieval, and for researchers entering these fields

OAPEN Library

Directory of Open Access Books (DOAB)

De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields

Author: Dalianis Hercules
Velupillai Sumithra
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident. We present work on the creation of two refined variants of a manually annotated Gold standard for deidentification, one created automatically, and one created through discussions among the annotators. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards; F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators. Our intention is to make this Gold standard available for other research groups in the future. Despite being slightly more timeconsuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data

Author: Dalianis Hercules
Vakili Thomas
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library

Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

Author: Dalianis Hercules
Jongejan Bart
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2009
Field of study

Crossref

Copenhagen University Research Information System

Data Migration Between Web Content Management Systems

Author: Dalianis Hercules
Vlachopoulou Maro
Zarmpou Theodora
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2009
Field of study

Web Content Management Systems (WCMS) have become necessary tools for today’s web oriented business world. The data migration between Web Content Management Systems, consequently, is an issue that more and more companies and organizations have to deal with while changing from an older WCMS to a newer one. This article examines the migration options and makes some useful suggestions to help with the choice of the appropriate migration method of the content from one Web Content Management System to another. The research is supported by a survey conducted with a number of large companies and organizations. The conclusions can be taken into consideration in order to evaluate the different data migration approaches

AIS Electronic Library (AISeL)

Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian languages

Author: Dalianis Hercules
Kann Viggo
Rimka Martin
Publication venue
Publication date: 11/05/2009
Field of study

Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 26-33. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206

DSpace at Tartu University Library

Releasing a Swedish Clinical Corpus after Removing all Words - De-identification Experiments with Conditional Random Fields and Random Forests

Author: Henrik Boström
Hercules Dalianis
Publication venue
Publication date: 03/04/2020
Field of study

Abstract Patient records contain valuable information in the form of both structured data and free text; however this information is sensitive since it can reveal the identity of patients. In order to allow new methods and techniques to be developed and evaluated on real world clinical data without revealing such sensitive information, researchers could be given access to de-identified records without protected health information (PHI), such as names, telephone numbers, and so on. One approach to minimizing the risk of revealing PHI when releasing text corpora from such records is to include only features of the words instead of the words themselves. Such features may include parts of speech, word length, and so on from which the sensitive information cannot be derived. In order to investigate what performance losses can be expected when replacing specific words with features, an experiment with two state-of-the-art machine learning methods, conditional random fields and random forests, is presented, comparing their ability to support de-identification, using the Stockholm EPR PHI corpus as a benchmark test. The results indicate severe performance losses when the actual words are removed, leading to the conclusion that the chosen features are not sufficient for the suggested approach to be viable

CiteSeerX

Uncertainty Detection as Approximate Max-Margin Sequence Labelling

Author: Dalianis Hercules
Eriksson Gunnar
Hassel Martin
Karlgren Jussi
Täckström Oscar
Velupillai Sumithra
Publication venue
Publication date: 01/01/2010
Field of study

This paper reports experiments for the CoNLL 2010 shared task on learning to detect hedges and their scope in natural language text. We have addressed the experimental tasks as supervised linear maximum margin prediction problems. For sentence level hedge detection in the biological domain we use an L1-regularised binary support vector machine, while for sentence level weasel detection in the Wikipedia domain, we use an L2-regularised approach. We model the in-sentence uncertainty cue and scope detection task as an L2-regularised approximate maximum margin sequence labelling problem, using the BIO-encoding. In addition to surface level features, we use a variety of linguistic features based on a functional dependency analysis. A greedy forward selection strategy is used in exploring the large set of potential features. Our official results for Task 1 for the biological domain are 85.2 F1-score, for the Wikipedia set 55.4 F1-score. For Task 2, our official results are 2.1 for the entire task with a score of 62.5 for cue detection. After resolving errors and final bugs, our final results are for Task 1, biological: 86.0, Wikipedia: 58.2; Task 2, scopes: 39.6 and cues: 78.5

CiteSeerX

Publikationer från Stockholms universitet

Publikationer från Uppsala Universitet

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

The Influence of NegEx on ICD-10 Code Prediction in Swedish: How is the Performance of BERT and SVM Models Affected by Negations?

Author: Budrionis Andrius
Chomutare Taridzo
Dalianis Hercules
Olsen Svenning Therese
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2022
Field of study

Clinical text contains many negated concepts since the physician excludes irrelevant symptoms when reasoning and concluding about the diagnosis. This study investigates the machine interpretation of negated symptoms and diagnoses using a rule-based negation detector and its influence on downstream text classification task. The study focuses on the effect of negated concepts and NegEx preprocessing on classifier performance for predicting ICD-10 gastro surgical codes assigned to discharge summaries. Based on the experiments, NegEx preprocessing resulted in a slight performance improvement for traditional machine learning model (SVM) and had no effect on the performance of the deep learning model KB/BERT

Munin - Open Research Archive

NORA - Norwegian Open Research Archives

Detecting hospital-acquired infections : A document classification approach using support vector machines and gradient tree boosting

Author: Dalianis Hercules
Ehrentraut Claudia
Ekholm Markus
Tanushi Hideyuki
Tiedemann Jörg
Publication venue
Publication date: 01/03/2018
Field of study

Hospital-acquired infections pose a significant risk to patient health, while their surveillance is an additional workload for hospital staff. Our overall aim is to build a surveillance system that reliably detects all patient records that potentially include hospital-acquired infections. This is to reduce the burden of having the hospital staff manually check patient records. This study focuses on the application of text classification using support vector machines and gradient tree boosting to the problem. Support vector machines and gradient tree boosting have never been applied to the problem of detecting hospital-acquired infections in Swedish patient records, and according to our experiments, they lead to encouraging results. The best result is yielded by gradient tree boosting, at 93.7percent recall, 79.7percent precision and 85.7percent F1 score when using stemming. We can show that simple preprocessing techniques and parameter tuning can lead to high recall (which we aim for in screening patient records) with appropriate precision for this task.Peer reviewe

Helsingin yliopiston digitaalinen arkisto